-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define observability requirements for stable components #11772
base: main
Are you sure you want to change the base?
Define observability requirements for stable components #11772
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #11772 +/- ##
==========================================
- Coverage 91.53% 91.45% -0.08%
==========================================
Files 443 447 +4
Lines 23766 23721 -45
==========================================
- Hits 21754 21694 -60
- Misses 1638 1653 +15
Partials 374 374 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Left a few comments. I think we also want to clarify this is not an exhaustive list: components may want to add other telemetry if it makes sense
docs/component-stability.md
Outdated
1. How much data the component receives. | ||
|
||
For receivers, this could be a metric counting requests, received bytes, scraping attempts, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would this metric measure for the hostmetrics receiver? Would the measurement be different depending on the OS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "scraping attempts", or maybe "number of metric points" scraped could make sense for the hostmetrics
receiver, and those would not need to differ based on the OS.
I don't think we need to be much more specific than that, as the "data received" metrics for receivers only really seem useful if you also have a comparable "data lost" metric.
docs/component-stability.md
Outdated
Stable components should emit enough internal telemetry to let users detect errors, as well as data | ||
loss and performance issues inside the component, and to help diagnose them if possible. | ||
|
||
The internal telemetry of a stable component should allow observing the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text seems to exclude extensions, I would clarify it like this:
The internal telemetry of a stable component should allow observing the following: | |
The internal telemetry of a stable pipeline component should allow observing the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I hadn't thought about extensions at all. Do you think some of this would be applicable to them?
Co-authored-by: Pablo Baeyens <[email protected]>
…ion for measuring performance.
Description
This PR defines observability requirements for components at the "Stable" stability levels. The goal is to ensure that Collector pipelines are properly observable, to help in debugging configuration issues.
Approach
Feel free to share if you don't think this is the right approach for these requirements.
Link to tracking issue
Resolves #11581
Important note regarding the Pipeline Instrumentation RFC
I included this paragraph in the part about error count metrics:
The Pipeline Instrumentation RFC (hereafter abbreviated "PI"), once implemented, should allow monitoring component errors via the
outcome
attribute, which is eithersuccess
orfailure
, depending on whether theConsumer
API call returned an error.Note that this does not work for receivers, or allow differentiating between different types of errors; for that reason, I believe additional component-specific error metrics will often still be required, but it would be nice to cover as many cases as possible automatically.
However, at the moment, errors are (usually) propagated upstream through the chain of
Consume
calls, so in case of error thefailure
state will end up applied to all components upstream of the actual source of the error. This means the PI metrics do not fit the first bullet point.Moreover, I would argue that even post-processing the PI metrics does not reliably allow distinguishing the ultimate source of errors (the second bullet point). One simple idea is to compute
consumed.items{outcome:failure} - produced.items{outcome:failure}
to get the number of errors originating in a component. But this only works if output items map one-to-one to input items: if a processor or connector outputs fewer items than it consumes (because it aggregates them, or translates to a different signal type), this formula will return false positives. If these false positives are mixed with real errors from the component and/or from downstream, the situation becomes impossible to analyze by just looking at the metrics.For these reasons, I believe we should do one of four things:
Consumer
API to no longer propagate errors, making the PI metric outcomes more precise.We could catch errors in whatever wrapper we already use to emit the PI metrics, log them for posterity, and simply not propagate them.
Note that some components already more or less do this, such as the
batchprocessor
, but this option may in principle break components which rely on downstream errors (for retry purposes for example).outcome
value, or add another attribute).This could be implemented by somehow propagating additional state from one
Consume
call to another, allowing us to establish the first appearance of a given error value in the pipeline.